library(tidyverse)
library(DT)
library(pander)
library(readr)
library(car)
library(plotly)

HSS <- read_csv("../../Data/HighSchoolSeniors.csv")
#Remember: select "Session, Set Working Directory, To Source File Location", and then play this R-chunk into your console to read the HSS data into R. 

Hypothesis

Does the Mean Value of Homework Hours Match Up With Computer Use Hours??

Anymore now days an extensive amount of homework is done on computers rather than the traditional methods of pencil and paper; I want to know whether the time spend on computers is equal to the time doing homework.

Formally, the null and alternative hypotheses are written as \[ H_0: \mu_\text{(hours doing homework - hours on computer)} = 0 \] \[ H_a: \mu_\text{(hours doing homework - hours on computer)} \neq 0 \]

The significance level for this study will be set at \[ \alpha = 0.05 \] # Quick Peek at the Data Table

editHSS <- HSS %>% 
  select('DataYear', 'Gender', 'Ageyears', 'Doing_Homework_Hours', 'Computer_Use_Hours') %>% 
  filter(Doing_Homework_Hours + Computer_Use_Hours < 168)

datatable(editHSS, options=list(lengthMenu = c(3,10,30),scrollY=300,scroller=TRUE,scrollX=TRUE), 
            extensions="Scroller")

This Data Table above is meant for a quick glimpse of the amount of time spend on home work compared to time spend on the computer. This was taken from a larger data table that was resulted from a survey given to high school students. The table only shows a few columns that are good to be familiar with. The table is also filtered to take out exaggerations because we know most high schools have a strong dislike for homework and would likely exaggerate the amounts, to filter this out I only return rows when the homework time and computer time where below 168. (168 is the amount of hours in a week)

Quartile- Quantile Graph

par(mfrow= c(1,2))
qqPlot(editHSS$Doing_Homework_Hours, id=FALSE)

From these graphs above we can see that these columns are not normally distributed. Even though the Q-Q plots don’t appear to be normally distributed we can go ahead and do our t-test anyways due to the central limit theorm, where sample means of data are considered to be normally distributed when there are over 30 points.

Over- Lapped Histograms

fig <- plot_ly(alpha = 0.6)
fig <- fig %>% add_histogram(x = ~(editHSS$Doing_Homework_Hours- editHSS$Computer_Use_Hours))
fig <- fig %>% layout(barmode = "stack", xaxis = list(title = 'Hours per Month'), title="Difference Spent on Homework and Hours")
fig

Looking a the over the histogram we can see that it is skewed to the left and that the bulk of students seem to spend more hours on a computer than doig homework.

Numerical Summary

#          Q1 = quantile(Doing_Homework_Hours, .25),
 #           Mean = mean(Doing_Homework_Hours),
  #          Median = median(Doing_Homework_Hours),
   #         Q3 = quantile(Doing_Homework_Hours, .75),
    #        Max = max(Doing_Homework_Hours)) %>% pander()

#editHSS %>%
 # summarise(Min = min(Computer_Use_Hours),
  #          Q1 = quantile(Computer_Use_Hours, .25),
   #         Mean = mean(Computer_Use_Hours),
    #        Median = median(Computer_Use_Hours),
     #       Q3 = quantile(Computer_Use_Hours, .75),
      #      Max = max(Computer_Use_Hours)) %>% pander()
Hours Min Q1 Mean Median Q3 Max
Homework 0 1.75 6.079 4 8 48
Computer 0 6 23.22 15 30 160

The numerical summary above shows the a number of different statistical summaries for each column, more specifically showing the differences in distribution as well as the center of means. Look at the means we can see there is quite a difference between them showing there’s a quite larger amount of time spent on the computer as compared to on homework. We will continue to put in place t-tests in order to validate our observations.

T- test

For this hypothesis we will be using an Independent Sample T-Test

 t.test(editHSS$Doing_Homework_Hours, editHSS$Computer_Use_Hours, alternative='two.sided', paired=TRUE) %>% pander()
Paired t-test: editHSS$Doing_Homework_Hours and editHSS$Computer_Use_Hours
Test statistic df P value Alternative hypothesis mean difference
-14.14 423 1.941e-37 * * * two.sided -17.14

Conclusion

The obtained p-value is less far less than 0.05 (\(p = 1.941e-37 < \alpha\)), showing that there is significant evidence to reject the null hypothesis. We can also see that the test-statistic is -14.14 which is quite a distance from zero leaning in favor of more time spent on computers. We can see that the mean difference is of 17.14 showing that on average students spend 17.14 hours per month on their computer rather than doing homework. From this we can safely say that the difference is not equal to zero and statistically significant.